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Abstract 

In this paper, we present a supervised learn- 
ing approach to training submodular scoring 
functions for extractive multi-document sum- 
marization. By taking a structured predicition 
approach, we provide a large-margin method 
that directly optimizes a convex relaxation of 
the desired performance measure. The learn- 
ing method applies to all submodular sum- 
marization methods, and we demonstrate its 
effectiveness for both pairwise as well as 
coverage-based scoring functions on multiple 
datasets. Compared to state-of-the-art func- 
tions that were tuned manually, our method 
significantly improves performance and en- 
ables high-fidelity models with numbers of pa- 
rameters well beyond what could reasonbly be 
tuned by hand. 

1 Introduction 

Automatic document summarization is the prob- 
lem of constructing a short text describing the main 
points in a (set of) document(s). Example appli- 
cations range from generating short summaries of 
news articles, to presenting snippets for URLs in 
web-search. In this paper we focus on extrac- 
tive multi-document summarization, where the final 
summary is a subset of the sentences from multi- 
ple input documents. In this way, extractive summa- 
rization avoids the hard problem of generating well- 
formed natural-language sentences, since only exist- 
ing sentences from the input documents are used. 

A current state-of-the-art method for document 
summarization was recently proposed by Lin and 



Bilmes ll22]| . using a submodular scoring function 
based on inter-sentence similarity. On the one hand, 
this scoring function rewards summaries that are 
similar to many sentences in the original documents 
(i.e. promotes coverage). On the other hand, it 
penalizes summaries that contain sentences that are 
similar to each other (i.e. discourages redundancy). 
While obtaining the exact summary that optimizes 
the objective is computationally hard, they show that 
a greedy algorithm is guaranteed to compute a good 
approximation. However, their work does not ad- 
dress how to select a good inter-sentence similarity 
measure, leaving this problem as well as selecting 
an appropriate trade-off between coverage and re- 
dundancy to manual tuning. 

To overcome this problem, we propose a super- 
vised learning method that can learn both the sim- 
ilarity measure as well as the coverage/reduncancy 
trade-off from training data. Furthermore, our learn- 
ing algorithm is not limited to the model of Lin 
and Bilmes [22], but applies to all submodular sum- 
marization models. Due to the diminishing-retums 
property of submodular set functions and their com- 
putational tractability, this class of functions pro- 
vides rich space for designing summarization meth- 
ods. To illustrate this point, we also provide experi- 
ments for a submodular coverage-based model orig- 
inally developed for diversified information retrieval 
16] . 

In general, our method learns a parameterized 
submodular scoring function from supervised train- 
ing data, and its implementation is available for 
downloa43 Given a set of documents and their 
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summaries as training examples, we formulate tiie 
learning problem as a structured prediction prob- 
lem and derive a maximum-margin algorithm in the 
structural SVM framework. Note that, unlike other 
learning approaches, our method does not require a 
heuristic decomposition of the learning task into bi- 
nary classification problems 125], but directly opti- 
mizes a structured prediction. This enables our algo- 
rithm to directly optimize the desired performance 
measure (e.g. ROUGE) during training. Further- 
more, our method is not Umited to linear-chain de- 
pendencies like Il27ll28]l . but can learn any submod- 
ular scoring function. 

This ability to easily train summarization models 
makes it possible to efficiently tune models to vari- 
ous types of document collections. In particular, we 
find that our learning method can reliably tune mod- 
els with hundreds of parameters based on a train- 
ing set of about 30 examples. This increases the fi- 
delity of models compared to their hand-tuned coun- 
terparts, showing significantly improved empirical 
performance. We provide a detailed investigation 
into the sources of these improvements, identifying 
further directions for research. 

2 Related work 

Work on extractive summarization spans a large 
range of approaches. Starting with unsupervised 
methods, one of the widely known approaches is 
MMR [12]. It uses a greedy approach for selec- 
tion and considers the trade-off between relevance 
and redundancy. Later it was extended 1131 to sup- 
port multi-document settings by incorporating ad- 
ditional information available in this case. Good 
results can be achieved by reformulating this as a 
knapsack packing problem and solving it using dy- 
namic programing fT4l. 

A popular stohastic graph-based summarization 
method is LexRank [15 J. It computes sentence im- 
portance based on the concept of eigenvector cen- 
trality in a graph of sentence similarities. Similarly, 
TextRank fT6\ is also graph based ranking system 
for identification of important sentences in a doc- 
ument by using sentence similarity and PageRank 
(TT\ . Sentence extraction can also be implemented 
using other graph based scoring approaches [1^ 
such as HITS [il9il and positional power functions. 



Graph based methods can also be paired with clus- 
tering such as in CollabSum ll20l . This approach 
first uses clustering to obtain document clusters and 
then uses graph based algorithm for sentence selec- 
tion which includes inter and intra-document sen- 
tence similarities. Another clustering based algo- 
rithm [21 1 is diversity based extension of MMR that 
finds diversity by clustering and then proceeds to 
reduce redundancy by selecting a representative for 
each cluster. 

The manually tuned sentence pairwise model 
Il22l |23l we took inspiration from is based on bud- 
geted submodular optimization. A summary is pro- 
duced by maximizing an objective function that in- 
cludes coverage and redundancy terms. Coverage 
is defined as the sum of sentence similarities be- 
tween the selected summary and the rest of the sen- 
tences, while redundancy is the sum of pairwise 
intra-summary sentence similarities. Another ap- 
proach based on submodularity |24| is relying on 
extracting important keyphrases from citation sen- 
tences for a given paper and using them to build the 
summary. 

In the supervised setting, a lot of early methods 
||25]| made independent binary decisions whether to 
include a particular sentence in the summary or not. 
This ignores dependencies between sentences and 
can result in high redundancy. The same problem 
arises when using learning to rank approaches such 
as ranking support vector machines, support vector 
regression and gradient boosted decision trees to se- 
lect the most relevant sentences for the summary 
1261. 

Introducing some dependencies can improve the 
performance. One limited way of introducing de- 
pendencies between sentences is by using a linear- 
chain HMM. The HMM is assumed to produce the 
summary by having a chain transitioning between 
summarization and non-summarization states [27] 
while traversing the sentences in a document. A 
more expressive approach is using a CRF for se- 
quence labeling [28] which can utilize larger and not 
necessarily independent feature spaces. The disad- 
vantage of using linear chain models, however, is 
that they represent the summary as a sequence of 
sentences. Dependencies between sentences that are 
far away from each other cannot be modeled ef- 
ficiently. In contrast to such Unear chain models, 



our approach on submodular scoring functions can 
model long-range dependencies. In this way our 
method can use properties of the whole summary 
when deciding which sentences to include in it. 

More closely related to our work is that of ||29l . 
They use the diversified retrieval method proposed 
in im for document summarization. Moreover, they 
assume that subtopic labels are available so that ad- 
ditional constraints for diversity, coverage and bal- 
ance can be added to the structural SVM learning 
problem. In contrast, our approach does not require 
the knowledge of subtopics (thus allowing us to ap- 
ply it to a wider range of tasks) and avoids adding 
additional constraints (simplifying the algorithm). 
Furthermore, it can use different submodular objec- 
tive functions, for example word coverage and sen- 
tence pairwise models described later in this paper. 

Another closely related work f9] also takes learn- 
ing approach in the structural SVM framework to 
summarize a set of documents. However, they do 
not consider submodular functions, but instead solve 
an Integer Linear Program (ILP) or an approxima- 
tion thereof. The ILP encodes a compression model 
where arbitrary parts of the parse trees of sentences 
in the summary can be cut and removed. This al- 
lows them to select parts of sentences and yet pre- 
serve some gramatical structure. Their work focuses 
on learning a particular compression model, while 
our work explores learning a general and large class 
of sentence selection models. 

3 Submodular document summarization 

In this section, we illustrate how document sum- 
marization can be addressed using submodular set 
functions. The set of documents to be summa- 
rized is split into a set of individual sentences x = 
{si, Sn}- The summarization method then se- 
lects a subset y C x of sentences that maximizes 
a given scoring function : 2^ — M subject to a 
budget constraint (e.g. less than B characters). 



y = avgrnax Fx{y) 

yCx 



s.t. \y\ < B (1) 



In the following we restrict the admissible scoring 
functions F to be submodular. 

Definition 1. Given a set x, a function F : 2^ — )• M 
is submodular iff for all u £ U and all sets s and t 



such that s Q t <^ X, we have, 

F{s U {u}) - F{s) > F{t U {n}) - F{t). 

Intuitively, this definition says that adding n to a 
subset s of t increases / at least as much as adding 
it to t. Using two specific submodular functions as 
examples, the following sections illustrate how this 
diminishing returns property naturally reflects the 
trade-off between maximizing coverage while mini- 
mizing redundancy. 



3.1 Pairwise scoring function 
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Figure 1: Illustration of the pairwise model. Not all 
edges are shown for clarity purposes. Edge thickness de- 
notes the similarity score. 

The first submodular scoring function we con- 
sider was proposed by [22] based on a model of pair- 
wise sentence similarities. It scores a summary y 
using the following function, which II22I shows is 
submodular. 

F^{y)= Yl ^(^'i)-^ E ^(^'•?')- (2) 

c^ihj) > denotes a measure of similarity be- 
tween pairs of sentences i and j. The first term in 
Eq. [2] is a measure of how similar the sentences in- 
cluded in summary y are to the other sentences in 
X. The second term penalizes y by how similar its 
sentences are to each other. A > is a scalar pa- 
rameter that trades off between the two terms. Max- 
imizing Fx{y) amounts to increasing the similarity 
of the summary to excluded sentences while mini- 
mizing repetitions in the summary. An example is 
illustrated in Figure [T] In the simplest case, a{i,j) 
may be the TFIDF IH cosine similarity, but we will 
show later how to learn sophisticated similarly func- 
tions. 



3.2 Coverage scoring function 

A second scoring function we consider was first pro- 
posed for diversified document retrieval (T\, but it 
naturally applies to document summarization as well 
|[29l . It is based on a notion of word coverage, where 
each word v has some importance weight uj{v) > 0. 
A summary y covers a word if at least one of its sen- 
tences contains the word. The score of a summary is 
then simply the sum of the word weights its covers 
(though we could also include a concave discount 
function that rewards covering a word multiple times 

mi). 

FAy)= oo{v) (3) 




V{y) denotes the union of all words in y. This func- 
tion is analogous to a maximum coverage problem, 
which is known to be submodular Q . 

words 

not covered 
covered 



selected 

summary 

sentence 



Figure 2: Illustration of the coverage model. Word bor- 
der thickness represents importance. 

An example of how a summary is scored is il- 
lustrated in the Figure |2] Analogous to the defini- 
tion of similarity cr{i,j) in the pairwise model, the 
choice of the word importance function uj{v) is cru- 
cial in the coverage model. A simple heuristic is to 
weigh words highly that occur in many sentences of 
X, but in few other documents lH. However, we will 
show in the following how to learn io{v) from train- 
ing data. 

3.3 Computing a Summary 

Computing the summary that maximizes either of 
the two scoring functions from above (i.e. Eqns. Q 
and Q) is NP-hard |[T4l . However, it is known that 
the greedy algorithm shown in Figure [3] can achieve 
a 1 — 1/e approximation to the optimum solution for 
any hnear budget constraint Ell |71. Even further, 



A -/^ X 

while A / do 

, ^ FAyU{l})-F,{y) 

k ^ arg max — r 

leA [ciY 
if Ck+J:^ey ^i<B and F, (y U {A:} ) - F, (y ) > 
tlien 

y<^y\j{k} 
end if 

A ^ A\{k] 
end wliile 

Figure 3: Greedy algorithm for finding the best summary 
y given a scoring function Fx(y)- Values represent 
costs of sentences (i.e. lengths). 

this algorithm provides a 1 — 1/e approximation for 
any monotone submodular scoring function. 

The algorithm starts with an empty summariza- 
tion. In each step, a sentence is added to the sum- 
mary that results in the maximum relative increase 
of the objective. The increase is relative to the 
amount of budget that is used by the added sen- 
tence. The algorithm terminates when the budget B 
is reached. 

Note that the algorithm has a parameter r in the 
denominator of the selection rule, which [22] report 
to have some impact on performance. Selecting r to 
be less than 1 gives more importance to "informa- 
tion density" (i.e. sentences that have a higher ratio 
of score increase per length). The 1 — ^ greedy ap- 
proximation guarantee holds despite this additional 
parameter [22]. More details on our choice of r and 
its effects are provided in the experiments section. 

4 Learning algorithm 

In this section, we propose a supervised learning 
method for training a submodular scoring function 
to produce desirable summaries. In particular, for 
the pairwise and the coverage model, we show how 
to learn the similarity function a{i,j) and the word 
importance weights io{v) respectively. In particu- 
lar, we parameterize a{i,j) and uj{v) using a linear 
model, allowing that each depends on the full set of 
input sentences x. 



aS,j) = ^'^4>l{hj) ^x{v) = ^'<Pl{v) (4) 



Tic, 



w is a weight vector that is learned, and 4'x{'i, j) and 
(/)x{v) are feature vectors. In the pairwise model, 
j) may include feature like the TFIDF cosine 
between i and j or the number of words from the 
document titles that i and j share etc. In the cov- 
erage model, (/>^(f) may include features like indi- 
cator of whether v occurs in more than 10% of the 
sentences in x or whether v occurs in the document 
title etc. 

We propose to learn the weights following a large- 
margin framework using structural SVMs. Struc- 
tural SVMs learn a discriminant function 



Vi : Wi ^ 
repeat 
for Vi do 

y ^ argmaxtt;^'I'(x*, y) + A(Y\ y) 

y 

if u;^^'(x',y*) + e < ViF-^{x\y) + 
A(y',y)-eithen 
^V^^^ViU {y} 

w ^ solve QP using constraints Wj 
end if 
end for 

until no Wi has changed during iteration 



= argmaxw ^(x,y) (5) 

that predicts a structured output y given a (possibly 
also structured) input x. G is called 

the joint feature-map between input x and output y. 
Note that both submodular scoring function in Eqns. 
^ and ([sjl can be brought into the form w^^{x, y) 
for the linear parametrization in Eq. (|6]) and ([7]). 

^P{x,y}=Y, €ihj) - A (6) 
^'{x,y)=Y^ <t>l{v) (7) 

After this transformation, it is easy to see that com- 
puting the maximizing summary in Eq. ([T]l and the 
structural SVM prediction rule in Eq. (|5]) are equiv- 
alent. 

To learn the weight vector w, structural SVMs 
require training examples (x^, y^), (x", y") of 
input/output pairs. In document summarization, 
however, the "correct" extractive summary is typ- 
ically not known. Instead, training documents x* 
are typically annotated with multiple manual (non- 
extractive) summaries (denoted by y*). To deter- 
mine a single extractive target summary for train- 
ing, we find the extractive summary that (approx- 
imately) optimizes ROUGE score - or some other 
loss function A(y*, y) - with respect to y*. 



yi = argminA(y*,y) 



(8) 



We call the determined in this way the "target" 
summary for x*. 



Figure 4: Cutting-plane algorithm for solving the learn- 
ing optimization problem using only polynomial number 
of steps to achieve a requested tolerance e. 

Following the structural SVM approach, we can 
now formulate the problem of learning as the fol- 
lowing quadratic program (QP): 



1 II ||2 C sr^ 
mm - w H > ti 



(9) 



=1 



s.t. w ' ^'(x*, y') -w' 'I'(x*, y*) > 

A(y^y')-e^, VyVy', Vl<i<n. 

The above formulation ensures that the scoring func- 
tion with the target summary (i.e. w^^'(x*,y*)) is 
larger than the scoring function for any other sum- 
mary y* (i.e., w^^^'(x*, y*)). The objective function 
learns a large margin weight vector w while trad- 
ing it off with an upper bound on the empirical loss. 
The two quantities are traded off with a parameter 
C > 0. 

Even though the QP has exponentially many con- 
straints in the number of sentences in the input doc- 
uments, it can be solved in polynomial time via a 
cutting plane algorithm [4|. The steps of the algo- 
rithm are shown in Figure [4] In each iteration of 
the algorithm, for each training document x*, a sum- 
mary y' which worst violates the constraint in (|9]l is 
found. This is done by solving 

y ^ Siigmaxw^^{x\ y) + A(y*, y) 

y&y 

which can be done efficiently by the greedy algo- 
rithm in Figure |3] After the worst violating con- 
straint for each training example is added, the re- 
sulting quadratic program is solved. These steps are 



repeated until all the constraints are satisfied to a re- 
quired precision e. 

Finally, special care has to be taken to appropri- 
ately define the loss function A given the disparity 
of and y*. Therefore, we first define an interme- 
diate loss function as follows: 

An{Y, y) = max(0, 1 - ROUGEIf{Y, y)), 

based on the (slightly simplified) ROUGE- 1 F score 
which is a standard metric for measuring the quality 
of a document summarization. To ensure that the 
loss function is zero for the target label as defined in 
([8]l, we normalized the above loss as below: 

A(r , y) = max(0, AR{Y\y) - /\R{Y\y')), 

The above loss A was used in our experiments. Thus 
training a structural SVM with this loss maximizes 
the ROUGE- 1 F score with the true manual sum- 
maries provided in the training examples while trad- 
ing it off with margin. Note that we could easily use 
a different loss function (as the method is not tied 
to this particualr choice) if we had a different tar- 
get evaluation metric. Finally, once a w is obtained 
from the structural SVM training, a prediction sum- 
mary for a test document x can be easily obtained 
from Q. 

5 Experiments 

In this section, we empirically evaluate the approach 
proposed in this paper. Following ll22l . experiments 
were conducted on two different datasets (DUG '03 
and '04). These datasets contain document sets with 
four manual summaries for each set. For each doc- 
ument set, we concatenated all the articles and spht 
them into sentences using the tool provided with the 
'03 dataset. For the supervised setting we used 10 
resamplings with a random 20/5/5 ('03) and 40/5/5 
('04) train/test/validation split. We determining the 
best C value using the performance on each valida- 
tion set and then report average performence over 
the corresponding test sets. Baseline performance 
(the approach of ll22i ) was computed using all 10 
test sets as a single test set. For all experiments and 
datasets, we used r = 0.3 in the greedy algorithm 
as recommended in Ell for the '03 dataset. We find 



that changing r has only a small influence on perfor- 
mance 

The construction of features for learning is orga- 
nized by word groups. The most trivial group is 
simply all words {basic). Gonsidering the proper- 
ties of the words themselves, we constructed sev- 
eral features from properties such as capitalized 
words, words of certain length and non-stop words 
(cap+stop+len). We obtained another set of fea- 
tures from the most frequently occuring words in all 
the articles (minmax). We also considered the po- 
sition of a sentence (containing the word) in the ar- 
ticle as another feature (location). All those word 
groups can then be further refined by selecting dif- 
ferent thresholds, weighting schemes (e.g. TFIDF) 
and forming binned variants of these features. 

For the pairwise model we use cosine similar- 
ity between sentences using only words in a given 
word group during computation. For the word cov- 
erage model we create separate features for cover- 
ing words in different groups. This gives us fairly 
comparable feature strength in both models. The 
only further addition is use of different word cov- 
erage levels in the coverage model. First we con- 
sider how well does a sentence cover a word (e.g. a 
sentence with five instances of the same word might 
cover it better than another with only a single in- 
stance). And secondly we look at how important it 
is to cover a word (e.g. if a word appears in a large 
fraction of sentences we might want to be sure to 
cover it). Gombining those two criteria using dif- 
ferent thresholds we get a set of features for each 
word. Our coverage features are motivated from the 
approach of [2]. In contrast, the hand-tuned pairwise 
baseline uses only TFIDF weighted cosine similar- 
ity between sentences using all words, following the 
approach in ll22ll . 

The resulting summaries are evaluated using 
ROUGE version 1.5.5 f3). We selected the ROUGE- 
1 F measure because it was used by [22] and because 
it is one of the commonly used performance scores 
in recent work. However, our learning method ap- 
plies to other performance measures as well. Note 
that we use the ROUGE- 1 F measure both for the 
loss function during learning, as well as for the eval- 

^Setting r to 1 and thus eliminating tlie non-linearity does 
lower the score (e.g. to 0.38466 for the pairwise model on DUG 
'03 compared with the results on FigurelSl. 



uation of the predicted summaries. 



0.415 



5,1 How does learning compare to manual 
tuning? 

In our first experiment, we compare our supervised 
learning approacli to the hand-tuned approach. The 
results from this experiment are summarized in Fig- 
ure |5] First, supervised training of the pairwise 
model (22\ resulted in a statistically significant (jj < 
0.05) increase in performance on both datasets com- 
pared to our reimplementation of the manually tuned 
pairwise model. Note that our reimplementation of 
the approach of ll22l resulted in slightly different 
performance numbers than those reported in |22| - 
better on DUC '03 and somewhat lower on DUC 
'04, if evaluated on the same selection of test exam- 
ples as theirs. We conjecture that this is due to small 
differences in implementation and/or preprocessing 
of the dataset. Furthermore, as authors of [22| note 
in their paper, the '03 and '04 datasets behave quite 
differently. 
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Figure 5: Results obtained on DUC '03 and '04 datasets 
using the supervised models. Increase in performance 
over the hand-tuned is statistically significant (p < 0.05) 
for the pairwise model on the both datasets, but only on 
DUC '03 for the coverage model. 

Figure [5]also reports the performance for the cov- 
erage model as trained by our algorithm. These re- 
sults can be compared against those for the pair- 
wise model. Since we are using features of com- 
parable strength in both approaches, as well as the 
same greedy algorithm and structural SVM learning 
method, this comparison largely reflects the quality 
of models themselves. On the '04 dataset both mod- 
els achieve the same performance while on '03 the 
pairwise model performs significantly (p < 0.05) 
better than the coverage model. 

Overall, pairwise model appears to perform 
slightly better than the coverage model with the 
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Figure 6: Learning curve for the pairwise model on 
DUC '04 dataset showing ROUGE- 1 F scores for dif- 
ferent numbers of learning examples (logarithmic scale). 
The dashed line represents the preformance of the hand- 
tuned model. 

datasets and features we used. Therefore, we focus 
on the pairwise model in the following. 

5.2 How fast does the algorithm learn? 

Hand-tuned approaches have limited flexibility. 
Whenever we move to a significantly different col- 
lection of documents we have to reinvest time to 
retune it. Learning can make this adaptation to a 
new collection more automatic and faster - espe- 
cially since training data has to be collected even for 
manual tuning. 

Figure [6]evaluates how effectively the learning al- 
gorithm can make use of a given amount of train- 
ing data. In particular, the figure shows the learning 
curve for our approach. Even with very few training 
examples the learning approach already outperforms 
the baseline. Furthermore, at the maximum number 
of training examples available to us the curve still 
increases. We therefore conjecture that more data 
would further improve performance. 

5.3 Where is room for improvement? 

To get a rough estimate of what is actually achiev- 
able in terms of the final ROUGE- 1 F score we 
looked at different "upper bounds" under various 
scenarios (Figure |7]l. First, ROUGE score is com- 
puted by using four manual summaries from differ- 
ent assessors, so that we can estimate inter-subject 
disagreement. If one computes the ROUGE score 
of a held-out summary against the remaining three 
summaries, the resulting performance is given in the 



row human of Figure [7] It provides a reasonable es- 
timate of human performance. 

Second, in extractive summarization we restrict 
summaries to sentences from the documents them- 
selves, which is likely to lead to a reduction in 
ROUGE. To estimate this drop, we use the greedy 
algorithm to select the extractive summary that max- 
imizes ROUGE on the test documents. The resulting 
performance is given in the row extractive of Fig- 
ure [7] On both dataset, the drop in performance for 
this (approximatel)]^ optimal extractive summary is 
about 10 points of ROUGE. 

Third, we expect some drop in performance, since 
our model may not be able to fit the optimal extrac- 
tive summaries due to a lack of expressiveness. This 
can be estimated by looking at training set perfor- 
mance, as reported in row model fit of Figure |7] On 
both datasets, we see a drop of about 5 points of 
ROUGE performance. Adding more and better fea- 
tures might help the model fit the data better. 

Finally, a last drop in performance may come 
from overfitting. The test set ROUGE scores are 
given in the row prediction of Figure [7] Note that 
the drop between training and test performance is 
rather small, so overfitting is not an issue and is well 
controlled in our algorithm. We therefore conclude 
that increasing model fidelity seems like a promising 
direction for further improvements. 

5.4 Which features are most useful? 

To understand which features affected the final per- 
formance of our approach, we assessed the strength 
of each set of our features. In particular, we looked 
at how the final test score changes when we removed 
certain features groups (described in the beginning 
of Section |5]l as shown in Figure [8] 

The most important group of features are the basic 
features (pure cosine similarity between sentences) 
since removing them results in the largest drop in 
performance. However, other features play a sig- 
nificant role too (i.e. only the basic ones are not 
enough to achieve good performance). This con- 
firms that performance can be improved by adding 
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Figure 7: Upper bounds on ROUGE- 1 F scores: agree- 
ment between manual summaries, greedily computed 
best extractive summaries, best model fit on the train set 
(using the best C value) and the test scores of the pairwise 
model. 



richer fatures instead of using only a single similar- 
ity score as in ll22i . Using learning for these com- 
plex model is essential, since hand-tuning is likely 
to be intractable. 

The second most important group of features con- 
sidering the drop in performance (i.e. location) 
looks at positions of sentences in the articles. This 
makes intuitive sense because the first sentences in 
news articles is usually packed with informatin. The 
other three groups do not have a significant impact 
on their own. 
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We compared the greedy algorithm with exhaustive search 
for up to three selected sentences (more than that would take 
too long). In about half the cases we got the same solution, in 
other cases the soultion was on average about 1% below optimal 
confirming that greedy selection works quite well. 



Figure 8: Effects of removing different feature groups 
on the DUC '04 dataset. Bold font marks significant dif- 
ference (p < 0.05) when compared to the full pariwise 
model. The most important are basic similarity features 
including all words (similar to |22|). The last feature 
group actually lowered the score but is included in the 
model because we only found this out later on DUC '04 
dataset. 



5.5 How important is it to train with multiple 
summaries? 

While having four manual summaries may be impor- 
tant for computing a reliable ROUGE score for eval- 
uation, it is not clear whether such an approach is the 
most efficient use of annotator resources for training. 
In our final experiment, we trained our method using 
only a single manual summary for each set of docu- 
ments. When using only a single manual summary, 
we arbitrarily took the first one out of the provided 
four reference summaries and used only it to com- 
pute the target label for training (instead of using 
average loss towards all four of them). Otherwise, 
the experimental setup was the same as in the previ- 
ous subsections, using the pairwise model. 

For DUG '04, the ROUGE- 1 F score obtained us- 
ing only a single summary per document set was 
0.4010, which is slightly but not significantly lower 
than the 0.4066 obtained with four summaries (as 
shown on Figure[5]l. Similarly, on DUG '03 the per- 
formance drop from 0.3929 to 0.3838 was not sig- 
nificant as well. 

Based on those results, we conjecture that hav- 
ing more documents sets with only a single man- 
ual summary is more useful for training than fewer 
training examples with better labels (i.e. multi- 
ple summaries). In both cases, we spend approxi- 
mately the same amount of effort (as the summaries 
are the most expensive component of the training 
data), however having more training examples helps 
(according to the learning curve presented before) 
while spending effort on multiple summaries ap- 
pears to have only minor benefit for training. 

6 Conclusions 

This paper presented a supervised learning ap- 
proach to extractive document summarization based 
on structual SVMs. The learning method applies 
to all submodular scoring functions, ranging from 
pairwise-similarity models to coverage-based ap- 
proaches. The learning problem is formulated into 
a convex quadratic program and then solved approx- 
imated using a cutting-plane method. In an empiri- 
cal evaluation, the structural SVM approach signifi- 
cantly outperforms conventional hand-tuned models 
on the DUG '03 and '04 datasets. A key advantage 
of the learning approach is its ability to handle large 



numbers of features, providing substantial flexibil- 
ity for building high-fidelity summarization models. 
Furthermore, it shows good control of overfitting, 
making it possible to train models even with only 
a few training examples. 
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